The life expectancy data is from kaggle, https://www.kaggle.com/kumarajarshi/life-expectancy-who. It contains 193 countries data. There are 22 columns in the dataset. All predicting variables was then divided into several broad categories:Immunization related factors, Mortality factors, Economical factors and Social factors.Through the analysis, we want to answer the following questions:
## need to do: some null value can be replaced with mean, some population can be obtained from internet,do not need
## to delete all with null value, need to do it later if have time.
## normalized data
## visualize data: country etc.
## use statistic model
## check random forest
from google.cloud import storage
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.preprocessing import StandardScaler
lifeExpec = pd.read_csv('gs://life2/LifeExpectancyData.csv', sep=",")
lifeExpec.head()
lifeExpec.shape
lifeExpec.rename(columns={" BMI ":"BMI","Life expectancy ":"Life_Expectancy","Adult Mortality":"Adult_Mortality",
"infant deaths":"Infant_Deaths","percentage expenditure":"Percentage_Exp","Hepatitis B":"HepatitisB",
"Measles ":"Measles"," BMI ":"BMI","under-five deaths ":"Under_Five_Deaths","Diphtheria ":"Diphtheria",
" HIV/AIDS":"HIV/AIDS"," thinness 1-19 years":"thinness_10to19_years"," thinness 5-9 years":"thinness_5to9_years","Income composition of resources":"Income_Comp_Of_Resources",
"Total expenditure":"Tot_Exp"},inplace=True)
lifeExpec.columns
lifeExpec.isnull().sum()
country_list = lifeExpec.Country.unique()
fill_list = ['Life_Expectancy','Adult_Mortality','Alcohol','HepatitisB','BMI','Polio','Tot_Exp','Diphtheria','GDP','Population','thinness_10to19_years','thinness_5to9_years','Income_Comp_Of_Resources','Schooling']
for country in country_list:
lifeExpec.loc[lifeExpec['Country'] == country,fill_list] = lifeExpec.loc[lifeExpec['Country'] == country,fill_list].interpolate()
# Drop remaining null values after interpolation.
lifeExpec.dropna(inplace=True)
lifeExpec.shape
countryNames = lifeExpec.Country.unique()
countryNames.size
life = lifeExpec.drop('Year', axis = 1)
decribe = life.describe()
decribe.round(2)
round(lifeExpec[['Status','Life_Expectancy']].groupby(['Status']).mean(),2)
## life Expectancy vs status
l =(round(lifeExpec.groupby('Status')['Life_Expectancy'].mean(), 2).to_numpy())
plt.figure(figsize=(6,6))
plt.bar(lifeExpec.groupby('Status')['Status'].count().index,lifeExpec.groupby('Status')['Life_Expectancy'].mean())
plt.xlabel("Status",fontsize=12,fontweight='bold')
plt.ylabel("Avg Life Expectancy",fontsize=12,fontweight='bold')
plt.ylim(0, 85)
plt.title("Life Expectancy vs Status", fontsize=15,fontweight='bold')
for i, v in enumerate(l):
plt.text(i +0.05,v +2,str(v), color='blue', fontweight='bold' )
plt.show()
stats.ttest_ind(lifeExpec.loc[lifeExpec['Status']=='Developed','Life_Expectancy'],lifeExpec.loc[lifeExpec['Status']=='Developing','Life_Expectancy'])
## check outliers using boxplot
col_dict = {'Life_Expectancy':1,'Adult_Mortality':2,'Infant_Deaths':3,'Alcohol':4,'Percentage_Exp':5,'HepatitisB':6,'Measles':7,'BMI':8,'Under_Five_Deaths':9,'Polio':10,'Tot_Exp':11,'Diphtheria':12,'HIV/AIDS':13,'GDP':14,'Population':15,'thinness_10to19_years':16,
'thinness_5to9_years':17,'Income_Comp_Of_Resources':18,'Schooling':19}
plt.figure(figsize=(20,30))
for variable,i in col_dict.items():
plt.subplot(5,4,i)
plt.boxplot(lifeExpec[variable],whis=1.5)
plt.title(variable)
plt.show()
The boxplot results shows there are outliers exist in each variables. Next, we will take a look at each indivual variable, to see if we need to normalize the variables.
## check life Expecancy
lifeExpec[lifeExpec["Life_Expectancy"] < 45]
lifeExpec[lifeExpec["Country"] == 'Sierra Leone']['Life_Expectancy']
lifeExpec[lifeExpec["Country"] == 'Malawi']['Life_Expectancy']
lifeExpec[lifeExpec["Country"] == 'Lesotho']['Life_Expectancy']
lifeExpec[lifeExpec["Country"] == 'Zambia']['Life_Expectancy']
lifeExpec[lifeExpec["Country"] == 'Zimbabwe']['Life_Expectancy']
Based on the analysis above, the life Expectancy low is mainly because these data are from developing country. Here we will keep the data.
## check Adult_Mortality
lifeExpec[lifeExpec["Adult_Mortality"] > 500]
The boxplot shows the data the mortality in some countries, such as Zimbabwe, Zambia, Lesotho, Botswana,Malawi are all high. That may be caused by some factors. We will do specific analysis on these countries to see if we can dig out some important information. Here, we mainly investigate Central African Republic, Eritrea,Haiti, Sierra Leone to see if the mortality is abnormal.
lifeExpec[lifeExpec["Country"] == 'Central African Republic']
Adult Mortality in Central African Republic looks very abnormal, the Adult Mortality between 2000 and 2003 are very low, arount 50, however, after 2004, it increased dramatically. Based on data from world bank, https://data.worldbank.org/indicator/SP.DYN.AMRT.MA?locations=CF, the Adult Mortality is always above 390. So the Adult Mortality in 2000, 2001, 2002, 2003 and 2006 are not correct. Here, the Adult Mortality of those years will be modified based on world bank data.
## make a copy
lifeExpecCopy = lifeExpec
## 2000
lifeExpec.loc[527,'Adult_Mortality'] = 540.27
## 2001
lifeExpec.loc[526,'Adult_Mortality'] = 548.42
## 2002
lifeExpec.loc[525,'Adult_Mortality'] = 556.57
##2003
lifeExpec.loc[524,'Adult_Mortality'] = 548.02
## 2006
lifeExpec.loc[521,'Adult_Mortality'] = 522.39
lifeExpec[lifeExpec["Country"] == 'Eritrea']
The Adult_Mortality of Eritrea in 2005 is abnormal. Based on data from world bank, https://data.worldbank.org/indicator/SP.DYN.AMRT.MA, the Adult Mortality is 202.96. So we change the value of the Adult_Mortality of Eritrea in 2005 into 202.96.
##2005 Eritrea
lifeExpec.loc[860,'Adult_Mortality'] = 202.96
lifeExpec[lifeExpec["Country"] == 'Haiti']
From above chart, we could see, the Adult_Mortality in Haiti in 2010,and from 2000 to 2006 are abnormal. Based on data from world bank, https://data.worldbank.org/indicator/SP.DYN.AMRT.MA?locations=HT, the Adult Mortality is decreasing, but the the Adult_Mortality is still above 250. So we need to modify the data based on world bank.
## 2000
lifeExpec.loc[1137,'Adult_Mortality'] = 337.64
## 2001
lifeExpec.loc[1136,'Adult_Mortality'] = 336.45
## 2002
lifeExpec.loc[1135,'Adult_Mortality'] = 335.25
##2003
lifeExpec.loc[1134,'Adult_Mortality'] = 329.85
## 2004
lifeExpec.loc[1133,'Adult_Mortality'] = 324.45
##2005
lifeExpec.loc[1132,'Adult_Mortality'] = 319.05
## 2006
lifeExpec.loc[1131,'Adult_Mortality'] = 313.65
## 2010
lifeExpec.loc[1127,'Adult_Mortality'] = 293.38
lifeExpec[lifeExpec["Country"] == 'Sierra Leone']
From above chart, we could see, the Adult_Mortality in Sierra Leone in 2003, 2005, 2007, 2013 are abnormal. Based on data from world bank, https://data.worldbank.org/indicator/SP.DYN.AMRT.MA, the Adult_Mortality is still above 400. So we need to modify the data based on world bank.
## modify the Adult_Mortality in Sierra Leone
## 2003
lifeExpec.loc[2309,'Adult_Mortality'] = 511.26
## 2005
lifeExpec.loc[2307,'Adult_Mortality'] = 483.17
## 2007
lifeExpec.loc[2305,'Adult_Mortality'] = 455.08
##2013
lifeExpec.loc[2299,'Adult_Mortality'] = 409.65
lifeExpec[lifeExpec["Country"] == 'Sierra Leone']
Infant death are the number of Infant Deaths per 1000 population. Based on the boxplot, we can see there is some value greater than 1000. These values must be not correct. The details shows below.
lifeExpec[lifeExpec["Infant_Deaths"] > 750]
The above shows that the infant death in India is greater than 1000. Based on data from world bank, https://data.worldbank.org/indicator/SP.DYN.IMRT.IN?locations=IN, the Adult_Mortality is below 400. So, we need to modify the infant death rate in India.
## 2000
lifeExpec.loc[1201,'Infant_Deaths'] = 66.7
## 2001
lifeExpec.loc[1200,'Infant_Deaths'] = 64.4
## 2002
lifeExpec.loc[1199,'Infant_Deaths'] = 62.2
##2003
lifeExpec.loc[1198,'Infant_Deaths'] = 60
## 2004
lifeExpec.loc[1197,'Infant_Deaths'] = 57.8
## 2005
lifeExpec.loc[1196,'Infant_Deaths'] = 55.7
## 2006
lifeExpec.loc[1195,'Infant_Deaths'] = 53.7
##2007
lifeExpec.loc[1194,'Infant_Deaths'] = 51.6
## 2008
lifeExpec.loc[1193,'Infant_Deaths'] = 49.4
## 2009
lifeExpec.loc[1192,'Infant_Deaths'] = 47.3
## 2010
lifeExpec.loc[1191,'Infant_Deaths'] = 45.1
##2011
lifeExpec.loc[1190,'Infant_Deaths'] = 43
##2012
lifeExpec.loc[1189,'Infant_Deaths'] = 40.9
## 2013
lifeExpec.loc[1188,'Infant_Deaths'] = 38.8
## 2014
lifeExpec.loc[1187,'Infant_Deaths'] = 36.9
lifeExpec[lifeExpec["Alcohol"] > 16]
It seems that there is only two countries whose alcohol values is higher,so a investigatation will be done below.
lifeExpec[lifeExpec["Country"] == 'Belarus']["Alcohol"]
lifeExpec[lifeExpec["Country"] == 'Estonia']["Alcohol"]
The data here looks a ittle bit sketptical, but without concrete reason, no further step is taken at this moment.
The lifeExpec data looks skeptical based on the boxplot and analysis below. As discribe on the website, Expenditure on health as a percentage of Gross Domestic Product per capita(%). For now, just keep the data here, further investigation will be done.
lifeExpec["Percentage_Exp"].max()
lifeExpec["Percentage_Exp"].min()
lifeExpec[lifeExpec["HepatitisB"] < 5].head()
HepatitisB: is Hepatitis B (HepB) immunization coverage among 1-year-olds, the HepatitisB rate lower than 5% are from developed country, so we will leave the data as original.
lifeExpec[lifeExpec["Measles"] > 30000]
lifeExpec["Measles"].max()
As the more Measles cases are in developing country, without further evidence that Measles is not correct, we will keep the data as original.
In the beggining, we analyze whether developed country or developing country, it make different with the Life_Expectancy. Next we will analyze other factors that could affect the life expendency. Let's see the heat map.
corr = lifeExpec.corr()
ax = plt.axes()
sns.heatmap(corr,
xticklabels=corr.columns,
yticklabels=corr.columns)
ax.set_title("Heat Map",fontsize = 20,fontweight='bold')
From the heat map above, we could see life Expectancy has negative correlation with Adult_Mortality, Infant_Deaths, Measles, HIV/AIDS, under-five-year death, thinness 1-19 years, thinness 5-9 years. Life Expectancy has positive correlation with alcohol, percentage_Exp, BMI,Diphtheria,GDP,income_comp_Of_Resource, and Schoolng. For poio, population, the correlation is so small, we could not say it's negative or positive. It worth to mention that life expectancy has high correlation with adult_mortality, BMI, Diphtheria, HIV/AIDS,thinness_1-19_years, thiness_5_to_9_years, income_of_resource and schooling. Further analysis is needed to gain more information.
Next we will investigate factors one by one, to see which factor will affect life expectancy and how it will affect. To dive more, the factors will be divided into immunization factors, mortality factors, economic factors, social factors and other health related factors.
## divide country into developed and developing counties
lifeExpecDevelping = lifeExpec[lifeExpec["Status"] == "Developing"]
lifeExpecDevelped = lifeExpec[lifeExpec["Status"] == "Developed"]
lifeExpecDevelping.shape
lifeExpec['Life_Expectancy'].corr(lifeExpec['Adult_Mortality'])
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y = lifeExpec['Life_Expectancy'], x = lifeExpec['Adult_Mortality'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Adult Mortality', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Adult Mortality", fontsize = 20,fontweight='bold')
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['Adult_Mortality'])
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelped, y = lifeExpecDevelped['Life_Expectancy'], x = lifeExpecDevelped['Adult_Mortality'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Adult_Mortality', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Adult_Mortality in Developed Countries", fontsize = 12,fontweight='bold')
plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['Adult_Mortality'])
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelping, y = lifeExpecDevelping['Life_Expectancy'], x = lifeExpecDevelping['Adult_Mortality'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Adult_Mortality', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Adult_Mortality", fontsize = 15,fontweight='bold')
plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
Above, the pearson correlation is used to test the correlation between Life Expectancy and Adult Mortality. The correlation is -0.7028. That means, the higher the mortality in a country, the shorter life expectancy. This applies to both developed countries and developing countries.
lifeExpec['Life_Expectancy'].corr(lifeExpec['Infant_Deaths'])
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y = lifeExpec['Life_Expectancy'], x = lifeExpec['Infant_Deaths'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Infant Death', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Infant Death", fontsize = 20,fontweight='bold')
From above analysis, the correlation between life Expectancy and Infant Death is relatively low, -0.3007. That means, the higher life expectancy, the relatively lower infant death rate. But the correlation is low. There is an interesting phenomenia observed from correlation graph. For developed country, there is almost no relationship between life expectancy and infant death. Only for developing country the infant death may have low relationship with life expectancy. Further analysis are done below.
## divide country into developed and developing counties
lifeExpecDevelping = lifeExpec[lifeExpec["Status"] == "Developing"]
lifeExpecDevelped = lifeExpec[lifeExpec["Status"] == "Developed"]
lifeExpecDevelping.shape
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelping, y = lifeExpecDevelping['Life_Expectancy'], x = lifeExpecDevelping['Infant_Deaths'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Infant Death', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Infant Death", fontsize = 20,fontweight='bold')
plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['Infant_Deaths'])
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['Infant_Deaths'])
Indeed, some countries, the life expectancy has high correlation with infant death. For these country, lower the infant death could be one factor increase the whole country life expectancy.
lifeExpec['Life_Expectancy'].corr(lifeExpec['Under_Five_Deaths'])
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y = lifeExpec['Life_Expectancy'], x = lifeExpec['Under_Five_Deaths'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Under Five Deaths', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Under Five Deaths", fontsize = 20,fontweight='bold')
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['Under_Five_Deaths'])
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['Under_Five_Deaths'])
There is very low relationship with life expectancy and under five death. For developed countries, there is no correlation between life expectancy and under five death.
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelping, y = lifeExpecDevelping['Life_Expectancy'], x = lifeExpecDevelping['Under_Five_Deaths'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Under Five Death', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Under Five Death", fontsize = 20,fontweight='bold')
plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
Only for a few counties, life expectancy and under five death have higher correlations. For these countries, lower the under five death could improve the life expectancy. We can do further analysis on these countries.
lifeExpec['Life_Expectancy'].corr(lifeExpec['HepatitisB'])
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y = lifeExpec['Life_Expectancy'], x = lifeExpec['HepatitisB'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('HepatitisB', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs HepatitisB", fontsize = 20,fontweight='bold')
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['HepatitisB'])
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['HepatitisB'])
lifeExpec['Life_Expectancy'].corr(lifeExpec['Polio'])
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y = lifeExpec['Life_Expectancy'], x = lifeExpec['Polio'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Polio', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Polio", fontsize = 20,fontweight='bold')
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['Polio'])
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['Polio'])
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelping, y = lifeExpecDevelping['Life_Expectancy'], x = lifeExpecDevelping['Polio'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Polio', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Polio", fontsize = 20,fontweight='bold')
plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
lifeExpec['Life_Expectancy'].corr(lifeExpec['Diphtheria'])
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y = lifeExpec['Life_Expectancy'], x = lifeExpec['Diphtheria'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Diphtheria', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Diphtheria", fontsize = 20,fontweight='bold')
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['Diphtheria'])
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['Diphtheria'])
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelping, y = lifeExpecDevelping['Life_Expectancy'], x = lifeExpecDevelping['Diphtheria'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Polio', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Diphtheria", fontsize = 20,fontweight='bold')
plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
lifeExpec['Life_Expectancy'].corr(lifeExpec['HIV/AIDS'])
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['HIV/AIDS'])
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['HIV/AIDS'])
lifeExpec['Life_Expectancy'].corr(lifeExpec['Percentage_Exp'])
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y = lifeExpec['Life_Expectancy'], x = lifeExpec['Percentage_Exp'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Percentage Expenditure', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Percentage Expenditure", fontsize = 15,fontweight='bold')
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['Percentage_Exp'])
lifeExpec['Life_Expectancy'].corr(lifeExpec['Tot_Exp'])
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y = lifeExpec['Life_Expectancy'], x = lifeExpec['Tot_Exp'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Total Expediture', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Total Expenditure", fontsize = 15,fontweight='bold')
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['Tot_Exp'])
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['Tot_Exp'])
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelped, y = lifeExpecDevelped['Life_Expectancy'], x = lifeExpecDevelped['Tot_Exp'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Total Expediture', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Total Expediture in Developed Countries", fontsize = 20,fontweight='bold')
plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
lifeExpec['Life_Expectancy'].corr(lifeExpec['GDP'])
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y = lifeExpec['Life_Expectancy'], x = lifeExpec['GDP'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('GDP', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs GDP", fontsize = 20,fontweight='bold')
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['GDP'])
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['GDP'])
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelped, y = lifeExpecDevelped['Life_Expectancy'], x = lifeExpecDevelped['GDP'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('GDP', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs GDP in Developed Countries", fontsize = 20,fontweight='bold')
plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelping, y = lifeExpecDevelping['Life_Expectancy'], x= lifeExpecDevelping['GDP'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('GDP', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs GDP in Developing Countries", fontsize = 20,fontweight='bold')
plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
lifeExpec['Life_Expectancy'].corr(lifeExpec['Income_Comp_Of_Resources'])
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y= lifeExpec['Life_Expectancy'],x = lifeExpec['Income_Comp_Of_Resources'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 10,fontweight='bold')
plt.xlabel('Income Composition of Resources', fontsize = 10,fontweight='bold')
ax.set_title("Life Expectancy vs Income Composition of Resources", fontsize = 12,fontweight='bold')
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelped, y = lifeExpecDevelped['Life_Expectancy'], x = lifeExpecDevelped['Income_Comp_Of_Resources'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Income Composition of Resources', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Income Composition of Resources in Developed Countries", fontsize = 12,fontweight='bold')
plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelping, y = lifeExpecDevelping['Life_Expectancy'], x = lifeExpecDevelping['Income_Comp_Of_Resources'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Income Composition of Resources', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Income Composition of Resources in Developed Countries", fontsize = 12,fontweight='bold')
plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['Income_Comp_Of_Resources'])
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['Income_Comp_Of_Resources'])
lifeExpec['Life_Expectancy'].corr(lifeExpec['Schooling'])
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y= lifeExpec['Life_Expectancy'],x = lifeExpec['Schooling'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 10,fontweight='bold')
plt.xlabel('Schooling', fontsize = 10,fontweight='bold')
ax.set_title("Life Expectancy vs Schooling", fontsize = 15,fontweight='bold')
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['Schooling'])
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['Schooling'])
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelping, y = lifeExpecDevelping['Life_Expectancy'], x = lifeExpecDevelping['Schooling'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Schooling', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Schooling", fontsize = 12,fontweight='bold')
plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
lifeExpec['Life_Expectancy'].corr(lifeExpec['Population'])
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y= lifeExpec['Life_Expectancy'],x = lifeExpec['Population'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 10,fontweight='bold')
plt.xlabel('Population', fontsize = 10,fontweight='bold')
ax.set_title("Life Expectancy vs Population", fontsize = 15,fontweight='bold')
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['Population'])
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['Population'])
lifeExpec['Life_Expectancy'].corr(lifeExpec['Measles'])
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['Measles'])
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['Measles'])
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y = lifeExpec['Life_Expectancy'], x = lifeExpec['Alcohol'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Alcohol', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Alcohol", fontsize = 20,fontweight='bold')
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['Alcohol'])
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['Alcohol'])
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelped, y = lifeExpecDevelped['Life_Expectancy'], x = lifeExpecDevelped['Schooling'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Schooling', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Schooling in Developed Countries", fontsize = 12,fontweight='bold')
plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
lifeExpec['Life_Expectancy'].corr(lifeExpec['BMI'])
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y = lifeExpec['Life_Expectancy'], x = lifeExpec['BMI'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('BMI', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs BMI", fontsize = 20,fontweight='bold')
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['BMI'])
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['BMI'])
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelping, y = lifeExpecDevelping['Life_Expectancy'], x = lifeExpecDevelping['BMI'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('BMI', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs BMI in Developing Countries", fontsize = 12,fontweight='bold')
plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
lifeExpec['Life_Expectancy'].corr(lifeExpec['thinness_10to19_years'])
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y = lifeExpec['Life_Expectancy'], x = lifeExpec['thinness_10to19_years'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('thinness 10 to 19 years', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs thinness 10 to 19 years", fontsize = 15,fontweight='bold')
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['thinness_10to19_years'])
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['thinness_10to19_years'])
ax = plt.axes()
sns.scatterplot(data = lifeExpecDevelped, y = lifeExpecDevelped['Life_Expectancy'], x = lifeExpecDevelped['Schooling'],hue = "Country")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('Schooling', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs Schooling in Developed Countries", fontsize = 12,fontweight='bold')
plt.legend(loc='upper left', bbox_to_anchor=(-0.2, -0.06),fancybox=True, shadow=True, ncol=5)
lifeExpec['Life_Expectancy'].corr(lifeExpec['thinness_5to9_years'])
ax = plt.axes()
sns.scatterplot(data = lifeExpec, y = lifeExpec['Life_Expectancy'], x = lifeExpec['thinness_5to9_years'],hue = "Status")
plt.ylabel('Life Expectancy', fontsize = 15,fontweight='bold')
plt.xlabel('thinness 5 to 9 years', fontsize = 15,fontweight='bold')
ax.set_title("Life Expectancy vs thinness 5 to 9 years", fontsize = 15,fontweight='bold')
lifeExpec['Life_Expectancy'].corr(lifeExpec['HIV/AIDS'])
lifeExpecDevelped['Life_Expectancy'].corr(lifeExpecDevelped['HIV/AIDS'])
lifeExpecDevelping['Life_Expectancy'].corr(lifeExpecDevelping['HIV/AIDS'])
# Life_Expectancy through years
plt.figure(figsize=(6,6))
plt.bar(lifeExpec.groupby('Year')['Year'].count().index,lifeExpec.groupby('Year')['Life_Expectancy'].mean(),color='cornflowerblue',alpha=0.65)
plt.xlabel("Year",fontsize=12)
plt.ylabel("Avg Life_Expectancy",fontsize=12)
plt.title("Life_Expectancy vs Year",fontweight='bold')
plt.show()
# Life_Expectancy through years
plt.figure(figsize=(6,6))
plt.bar(lifeExpecDevelped.groupby('Year')['Year'].count().index,lifeExpecDevelped.groupby('Year')['Life_Expectancy'].mean(),color='cornflowerblue',alpha=0.65)
plt.xlabel("Year",fontsize=12)
plt.ylabel("Avg Life_Expectancy",fontsize=12)
plt.title("Life_Expectancy in developed countries vs Year",fontweight='bold')
plt.show()
# Life_Expectancy through years
plt.figure(figsize=(6,6))
plt.bar(lifeExpecDevelping.groupby('Year')['Year'].count().index,lifeExpecDevelping.groupby('Year')['Life_Expectancy'].mean(),color='cornflowerblue',alpha=0.65)
plt.xlabel("Year",fontsize=12)
plt.ylabel("Avg Life_Expectancy",fontsize=12)
plt.title("Life_Expectancy in developing countries vs Year",fontweight='bold')
plt.show()
## check correlation within features
lifeExpec['thinness_5to9_years'].corr(lifeExpec['thinness_10to19_years'])
lifeExpec['Income_Comp_Of_Resources'].corr(lifeExpec['Schooling'])
lifeExpec['Infant_Deaths'].corr(lifeExpec['Under_Five_Deaths'])
lifeExpec['GDP'].corr(lifeExpec['Percentage_Exp'])
## fit linear regression model
response = lifeExpecDevelped["Life_Expectancy"]
predictors = lifeExpecDevelped[['Adult_Mortality','Percentage_Exp', 'GDP','Income_Comp_Of_Resources','Schooling','Population','thinness_10to19_years']]
model =sm.OLS(response,predictors).fit()
##print (model.params)
print (model.summary())
response = lifeExpecDevelped["Life_Expectancy"]
predictors1 = lifeExpecDevelped[['Adult_Mortality', 'GDP','Population','thinness_10to19_years']]
model1 =sm.OLS(response,predictors1).fit()
print (model1.summary())
from sklearn.model_selection import train_test_split
trainDataDeveloped = lifeExpecDevelped.drop(["Country", "Year",'Status'],axis=1)
##XDeveloped = trainDataDeveloped.drop(['Life_Expectancy'],axis=1)
XDeveloped = lifeExpecDevelped[['Adult_Mortality','Percentage_Exp', 'GDP','Income_Comp_Of_Resources','Schooling','Population','thinness_10to19_years']]
yDeveloped = trainDataDeveloped["Life_Expectancy"]
X_train, X_test, y_train, y_test = train_test_split(XDeveloped, yDeveloped, test_size=0.30, random_state=101)
from sklearn.linear_model import LinearRegression
Linear_model= LinearRegression()
Linear_model.fit(X_train,y_train)
predictions1=Linear_model.predict(X_test)
Linear_model.coef_
lifeExpec.columns
trainData = lifeExpec.drop(["Country", "Year",'Status'],axis=1)
X = trainData.drop(['Life_Expectancy'],axis=1)
y = trainData["Life_Expectancy"]
trainData2 = lifeExpec.drop(["Country", "Year",'Status','Schooling','thinness_10to19_years'],axis=1)
X = trainData2.drop(['Life_Expectancy'],axis=1)
y = trainData2["Life_Expectancy"]
trainData2.columns
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
## normalize data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(trainData2 )
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
# Get numerical feature importances
importances = list(regressor.feature_importances_)
feature_list = list(X.columns)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];
trainDataDeveloped = lifeExpecDevelped.drop(["Country", "Year",'Status'],axis=1)
XDeveloped = trainDataDeveloped.drop(['Life_Expectancy'],axis=1)
yDeveloped = trainDataDeveloped["Life_Expectancy"]
## normalize data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(trainDataDeveloped )
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(XDeveloped, yDeveloped, test_size=0.2, random_state=0)
regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
# Get numerical feature importances
importances = list(regressor.feature_importances_)
feature_list = list(X.columns)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];
plt.figure(1)
feat_importances = pd.Series(regressor.feature_importances_, index=X_train.columns)
feat_importances.nlargest(5).plot(kind='barh')
plt.title("Feature Importance in Developed Countries", fontWeight = 'bold')
trainDataDeveloping = lifeExpecDevelping.drop(["Country", "Year",'Status'],axis=1)
XDeveloping = trainDataDeveloping.drop(['Life_Expectancy'],axis=1)
yDeveloping = trainDataDeveloping["Life_Expectancy"]
## normalize data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(trainDataDeveloping )
X_train, X_test, y_train, y_test = train_test_split(XDeveloping, yDeveloping, test_size=0.2, random_state=0)
regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
# Get numerical feature importances
importances = list(regressor.feature_importances_)
feature_list = list(X.columns)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];
plt.figure(1)
feat_importances = pd.Series(regressor.feature_importances_, index=X_train.columns)
feat_importances.nlargest(5).plot(kind='barh')
plt.title("Feature Importance in Developing Countries", fontWeight = 'bold')
## get the lifeexpectancy is lower than 50
lifeExpectLow = lifeExpec[lifeExpec["Life_Expectancy"] < 50]
len(lifeExpectLow["Country"].unique())
trainDataDevelopingLow = lifeExpectLow.drop(["Country", "Year",'Status'],axis=1)
XDevelopingLow = trainDataDevelopingLow.drop(['Life_Expectancy'],axis=1)
yDevelopingLow = trainDataDevelopingLow["Life_Expectancy"]
## normalize data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(trainDataDevelopingLow )
X_train, X_test, y_train, y_test = train_test_split(XDevelopingLow, yDevelopingLow, test_size=0.2, random_state=0)
regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
# Get numerical feature importances
importances = list(regressor.feature_importances_)
feature_list = list(X.columns)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];
lifeExpectLow.describe()
plt.figure(1)
feat_importances = pd.Series(regressor.feature_importances_, index=X_train.columns)
feat_importances.nlargest(5).plot(kind='barh')
plt.title("Feature Importance in low Life Expetancy Countries", fontWeight = 'bold')